In baseball, like most other competitions and sports, the way to win games is score more points than your opponent. The way to do that in baseball is by hitting a baseball so that a batter can get on base and find a way to home base either by his own doing via a homerun or by collecting a string of hits before the defensive fielders can record 3 outs in a period.
Of the major American sports, baseball has long been the most statistically-approachable sport and incorporated into roster construction for optimal performance. Within the last 10 years, satidum technology has skyrocketed in order to record more data about in-game behavior for teams to convert into productive strategies on and off the field. Teams that have a great ability to score and limit opponent scoring are often the highest achieving teams in any given season. Scoring opportunities are maximized if there are players on base that a batter can send to home plate if they are able to record a hit.
The goal of this analysis is to consolidate data on contacted baseballs and identify the best model for predicting the outcome of contacted baseballs as well as determining the most variables that contribute to a higher hit probability on any given ball hit into play. This project will focus on MLB (major league baseball) games during the 2022 at Wrigley Field where the Chicago Cubs play their home games.
The statistical data used in this project comes from pybaseball, a Python package for baseball data analysis. The following comes from Pybaseball's GitHub page:
This package scrapes Baseball Reference, Baseball Savant, and FanGraphs so you don't have to. The package retrieves statcast data, pitching stats, batting stats, division standings/team records, awards data, and more. Data is available at the individual pitch level, as well as aggregated at the season level and over custom time periods. See the docs for a comprehensive list of data acquisition functions.
The variable descriptions for the initial, imported dataset (before preprocessing) can be found here . After preprocessing, a dataframe displaying the final variable descriptions can be found in the Column Descriptions sub-section later in the notebook.
import sklearn
sklearn.__version__
'1.2.2'
RANDOM_STATE = 95
import datetime as dt
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix, balanced_accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.neural_network import MLPClassifier
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning, FitFailedWarning
simplefilter('ignore', category = ConvergenceWarning)
simplefilter('ignore', category = FitFailedWarning)
simplefilter('ignore', category = UserWarning)
#install pybaseball module
!pip install baseball
Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: baseball in c:\users\betfa\appdata\roaming\python\python39\site-packages (16.16) Requirement already satisfied: python-dateutil in c:\program files\python39\lib\site-packages (from baseball) (2.8.2) Requirement already satisfied: requests in c:\program files\python39\lib\site-packages (from baseball) (2.27.1) Requirement already satisfied: pytz in c:\program files\python39\lib\site-packages (from baseball) (2021.3) Requirement already satisfied: six>=1.5 in c:\program files\python39\lib\site-packages (from python-dateutil->baseball) (1.16.0) Requirement already satisfied: idna<4,>=2.5 in c:\program files\python39\lib\site-packages (from requests->baseball) (3.3) Requirement already satisfied: charset-normalizer~=2.0.0 in c:\program files\python39\lib\site-packages (from requests->baseball) (2.0.10) Requirement already satisfied: certifi>=2017.4.17 in c:\program files\python39\lib\site-packages (from requests->baseball) (2021.10.8) Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\program files\python39\lib\site-packages (from requests->baseball) (1.26.8)
[notice] A new release of pip is available: 23.0.1 -> 23.1.2 [notice] To update, run: python.exe -m pip install --upgrade pip
import pybaseball
from pybaseball import statcast, spraychart
Enable caching function for large, repeated queries
pybaseball.cache.enable()
Import data from pybaseball:
every_pitch_2022 = statcast(start_dt = '2022-04-07', end_dt = '2022-10-02', #date period for 2022 regular seasion
verbose = False)
100%|████████████████████████████████████████████████████████████████████████████████| 179/179 [00:12<00:00, 14.53it/s]
every_pitch_2022.head()
| pitch_type | game_date | release_speed | release_pos_x | release_pos_z | player_name | batter | pitcher | events | description | ... | fld_score | post_away_score | post_home_score | post_bat_score | post_fld_score | if_fielding_alignment | of_fielding_alignment | spin_axis | delta_home_win_exp | delta_run_exp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2747 | FF | 2022-10-02 | 93.5 | 1.7 | 5.95 | Zastryzny, Rob | 663993 | 642239 | strikeout | foul_tip | ... | 8 | 3 | 8 | 3 | 8 | Standard | Standard | 167 | 0.0 | -0.083 |
| 2839 | CU | 2022-10-02 | 80.9 | 2.06 | 5.59 | Zastryzny, Rob | 663993 | 642239 | None | ball | ... | 8 | 3 | 8 | 3 | 8 | Standard | Standard | 273 | 0.0 | 0.014 |
| 3001 | FF | 2022-10-02 | 92.3 | 1.65 | 5.93 | Zastryzny, Rob | 663993 | 642239 | None | foul | ... | 8 | 3 | 8 | 3 | 8 | Standard | Standard | 157 | 0.0 | 0.0 |
| 3126 | FF | 2022-10-02 | 94.3 | 1.65 | 5.93 | Zastryzny, Rob | 663993 | 642239 | None | ball | ... | 8 | 3 | 8 | 3 | 8 | Standard | Standard | 159 | 0.0 | 0.011 |
| 3162 | FF | 2022-10-02 | 93.4 | 1.81 | 5.93 | Zastryzny, Rob | 663993 | 642239 | None | called_strike | ... | 8 | 3 | 8 | 3 | 8 | Standard | Standard | 156 | 0.0 | -0.023 |
5 rows × 92 columns
every_pitch_2022.shape
(695196, 92)
Limit observations to pitches at Wrigley Field, that are hit into play, and are not bunt attempts:
wrigley_2022 = every_pitch_2022[(every_pitch_2022['home_team'] == 'CHC')] #limits observations to pitches at Wrigley Field
#every pitch hit into play
wrigley_2022 = wrigley_2022[(wrigley_2022['type'] == 'X')]
#drop all pitches that are categorized as sacrifice bunts, sacrifice bunt doubple plays or general bunt attempts
wrigley_2022.drop(wrigley_2022[(wrigley_2022['events'] == 'sac_bunt') |
wrigley_2022['des'].str.contains('bunt')].index,
inplace = True)
wrigley_2022_newCat = wrigley_2022.copy()
#rename columns
wrigley_2022_newCat.rename(columns = {'if_fielding_alignment': 'if_alignment', 'of_fielding_alignment': 'of_alignment',
'outs_when_up':'batter_outs', 'inning_topbot':'inning_half'},
inplace = True)
#new target variable
positive_outcomes = ['single', 'double', 'home_run', 'triple']
wrigley_2022_newCat['outcome'] = np.where(wrigley_2022_newCat['events'].isin(positive_outcomes), 'Hit', 'Non-hit')
#new hit_destination variable
wrigley_2022_newCat['hit_location'] = wrigley_2022_newCat['hit_location'].fillna(0) #set null values to 0
hit_loc_positions = ['seats', 'P', 'C', '1B', '2B', '3B', 'SS', 'LF', 'CF', 'RF']
for i in range(10): #create new variables with choice labels^
wrigley_2022_newCat.loc[wrigley_2022_newCat['hit_location'] == i, 'hit_destination'] = hit_loc_positions[i]
#new pitch_count variable
for x in [0,1, 2, 3]:
for y in [0, 1, 2]:
wrigley_2022_newCat.loc[((wrigley_2022_newCat['balls'] == x) & (wrigley_2022_newCat['strikes'] == y)),
'pitch_count'] = '{}-{}'.format(x,y)
#new contact_type variable
wrigley_2022_newCat = wrigley_2022_newCat[wrigley_2022_newCat['launch_speed_angle'].notna()] #remove launch_speed_angle_nulls
contact_types = ['Weak', 'Topped', 'Under', 'Flare/Burner', 'Solid', 'Barrel']
for i in range(6): #create new variables with type labels^
wrigley_2022_newCat.loc[wrigley_2022_newCat['launch_speed_angle'] == (i+1), 'contact_type'] = contact_types[i]
#new score differential variables
wrigley_2022_newCat['run_diff'] = wrigley_2022_newCat['bat_score'] - wrigley_2022_newCat['fld_score']
#binarize baserunning variables
for item in ['on_1b', 'on_2b', 'on_3b']:
wrigley_2022_newCat[item] = wrigley_2022_newCat[item].fillna(0)
wrigley_2022_newCat.loc[wrigley_2022_newCat[item] > 0, item] = 1
wrigley_2022_newCat[item] = np.where(wrigley_2022_newCat[item] == 1, 'yes', 'no')
#new month variables
wrigley_2022_newCat['game_month'] = wrigley_2022_newCat['game_date'].dt.month
Definitions for every variable in the original dataset
Identifiers: variables that label players and teams
Estimations: variables derived from other variables that are too highly correlated with the target variables
Deprecated: null variables that have been replaced by newer ones but retained for older observations
Descriptions: variables that describe the play
Score: variables that reveal the score at different points during the pitch
Converted: new representations of these variables exist
Dummy: categorical variables that will be converted to dummy variables soon
id_variables = ['player_name', 'batter', 'pitcher', 'game_type', 'game_year', 'home_team', 'away_team', 'fielder_2',
'umpire', 'sv_id', 'game_pk','pitcher.1', 'fielder_2.1', 'fielder_3', 'fielder_4', 'fielder_5', 'fielder_6',
'fielder_7', 'fielder_8', 'fielder_9']
estimation_variables = ['estimated_ba_using_speedangle', 'estimated_woba_using_speedangle', 'woba_value', 'woba_denom',
'babip_value', 'iso_value', 'delta_home_win_exp', 'delta_run_exp']
deprecated_variables = ['spin_dir', 'spin_rate_deprecated', 'break_angle_deprecated', 'break_length_deprecated',
'tfs_deprecated', 'tfs_zulu_deprecated']
description_variables = ['description', 'des', 'type', 'events', 'pitch_name','at_bat_number', 'pitch_number']
score_variables = ['game_date', 'home_score', 'away_score', 'bat_score', 'fld_score', 'post_home_score',
'post_away_score', 'post_bat_score','post_fld_score']
converted_variables = ['hit_location', 'launch_speed_angle', 'balls', 'strikes']
wrigley_2022_trimmedCat = wrigley_2022_newCat.copy() #new copy of dataset
wrigley_2022_trimmedCat.drop(columns =
(id_variables + estimation_variables + deprecated_variables + description_variables +
score_variables + converted_variables),
inplace = True)
Drop observations with null values from dataset:
wrigley_2022_trimmedCat.dropna(inplace = True)
Convert categorical/ordinal variables to appropriate data type:
dummy_vars = ['pitch_type', 'zone', 'bb_type', 'batter_outs', 'inning', 'if_alignment', 'of_alignment', 'hit_destination',
'pitch_count', 'contact_type', 'game_month', 'stand', 'p_throws', 'inning_half', 'on_1b', 'on_2b', 'on_3b']
dummy_dict = {}
for item in dummy_vars:
dummy_dict[item] = 'category'
wrigley_2022_trimmedCat = wrigley_2022_trimmedCat.astype(dummy_dict)
#move target variable to last column position
wrigley_2022_trimmedCat.insert(loc = wrigley_2022_trimmedCat.shape[1]-1,
column = 'outcome',
value = wrigley_2022_trimmedCat.pop('outcome'))
| Variable | Description |
|---|---|
| pitch_type | The type of pitch derived from Statcast |
| release_speed | Pitch velocities from Statcast, which are reported out-of-hand |
| release_pos_x | Horizontal Release Position of the ball measured in feet from the catcher's perspective |
| release_pos_z | Vertical Release Position of the ball measured in feet from the catcher's perspective |
| zone | Zone location of the ball when it crosses the plate from the catcher's perspective |
| bb_type | Batted ball type, ground_ball, line_drive, fly_ball, popup |
| pfx_x | Horizontal movement in feet from the catcher's perspective |
| pfx_z | Vertical movement in feet from the catcher's perpsective |
| plate_x | Horizontal position of the ball when it crosses home plate from the catcher's perspective |
| plate_z | Vertical position of the ball when it crosses home plate from the catcher's perspective |
| on_3b | Whether 3rd base is occupied or empty |
| on_2b | Whether 2nd base is occupied or empty |
| on_1b | Whether 1st base is occupied or empty |
| batter_outs | Pre-pitch number of outs |
| inning | Pre-pitch inning number |
| hc_x | Hit coordinate X of batted ball |
| hc_y | Hit coordinate Y of batted ball |
| vx0 | The velocity of the pitch, in feet per second, in x-dimension, determined at y=50 feet |
| vy0 | The velocity of the pitch, in feet per second, in y-dimension, determined at y=50 feet |
| vz0 | The velocity of the pitch, in feet per second, in z-dimension, determined at y=50 feet |
| ax | The acceleration of the pitch, in feet per second per second, in x-dimension, determined at y=50 feet |
| ay | The acceleration of the pitch, in feet per second per second, in y-dimension, determined at y=50 feet |
| az | The acceleration of the pitch, in feet per second per second, in z-dimension, determined at y=50 feet |
| sz_top | Top of the batter's strike zone set by the operator when the ball is halfway to the plate |
| sz_bot | Bottom of the batter's strike zone set by the operator when the ball is halfway to the plate |
| hit_distance_sc | Projected hit distance of the batted ball |
| launch_speed | Exit velocity of the batted ball as tracked by Statcast. For the limited subset of batted balls not tracked directly, estimates are included |
| launch_angle | Launch angle of the batted ball as tracked by Statcast. For the limited subset of batted balls not tracked directly, estimates are included |
| effective_speed | Derived speed based on the the extension of the pitcher's release |
| release_spin_rate | Spin rate of pitch tracked by Statcast |
| release_extension | Release extension of pitch in feet as tracked by Statcast |
| release_pos_y | Release position of pitch measured in feet from the catcher's perspective |
| if_alignment | Infield fielding alignment at the time of the pitch |
| of_alignment | Outfield fielding alignment at the time of the pitch |
| spin_axis | The Spin Axis in the 2D X-Z plane in degrees from 0 to 360, such that 180 represents a pure backspin fastball and 0 degrees represents a pure topspin (12-6) curveball |
| hit_destination | Position of first fielder to touch the ball |
| pitch_count | Number of balls and strikes in at-bat count |
| contact_type | Quality of contact |
| run_diff | Difference between batter team score and fielding team score |
| game_month | Month that the game occurs |
| stand | Side of the plate batter is standing |
| p_throws | Hand pitcher throws with |
| inning_half | Pre-pitch top or bottom of inning. |
| outcome | TARGET VARIABLE Whether the batted ball resulted in a hit or non-hit |
Descriptive statistics for categorical variables:
wrigley_2022_trimmedCat.describe(include = 'category').T
| count | unique | top | freq | |
|---|---|---|---|---|
| pitch_type | 3937 | 12 | FF | 1087 |
| zone | 3937 | 13 | 5 | 581 |
| stand | 3937 | 2 | R | 2575 |
| p_throws | 3937 | 2 | R | 2930 |
| bb_type | 3937 | 4 | ground_ball | 1774 |
| on_3b | 3937 | 2 | no | 3561 |
| on_2b | 3937 | 2 | no | 3224 |
| on_1b | 3937 | 2 | no | 2731 |
| batter_outs | 3937 | 3 | 0 | 1339 |
| inning | 3937 | 11 | 5 | 475 |
| inning_half | 3937 | 2 | Top | 2000 |
| if_alignment | 3937 | 3 | Standard | 2465 |
| of_alignment | 3937 | 2 | Standard | 3770 |
| hit_destination | 3937 | 10 | CF | 623 |
| pitch_count | 3937 | 12 | 0-0 | 607 |
| contact_type | 3937 | 6 | Topped | 1367 |
| game_month | 3937 | 7 | 6 | 856 |
Descriptive statistics for numeric variables:
wrigley_2022_trimmedCat.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| release_speed | 3937.0 | 88.290831 | 6.188933 | 38.4 | 84.2 | 89.9 | 92.9 | 100.9 |
| release_pos_x | 3937.0 | -0.899703 | 1.845292 | -4.65 | -2.15 | -1.5 | 0.54 | 4.7 |
| release_pos_z | 3937.0 | 5.71208 | 0.564362 | 1.01 | 5.47 | 5.76 | 6.07 | 7.28 |
| pfx_x | 3937.0 | -0.170996 | 0.861423 | -2.02 | -0.94 | -0.26 | 0.52 | 1.82 |
| pfx_z | 3937.0 | 0.586175 | 0.688991 | -1.71 | 0.16 | 0.68 | 1.13 | 2.22 |
| plate_x | 3937.0 | -0.02969 | 0.538556 | -1.83 | -0.42 | -0.04 | 0.37 | 1.76 |
| plate_z | 3937.0 | 2.324201 | 0.563606 | 0.57 | 1.93 | 2.31 | 2.71 | 4.3 |
| hc_x | 3937.0 | 125.195712 | 39.062556 | 12.3 | 102.07 | 123.79 | 152.06 | 235.06 |
| hc_y | 3937.0 | 125.69619 | 39.541513 | 15.75 | 96.35 | 131.57 | 157.21 | 242.11 |
| vx0 | 3937.0 | 2.528488 | 5.34021 | -13.615462 | -0.903108 | 4.049747 | 6.29278 | 15.38742 |
| vy0 | 3937.0 | -128.452147 | 8.982594 | -146.55537 | -135.097761 | -130.772166 | -122.573969 | -54.717227 |
| vz0 | 3937.0 | -3.392288 | 2.675112 | -10.470656 | -5.275082 | -3.644232 | -1.841158 | 12.555114 |
| ax | 3937.0 | -2.798256 | 10.2554 | -28.990469 | -11.793605 | -2.98619 | 5.046433 | 22.356139 |
| ay | 3937.0 | 26.284862 | 3.799859 | 4.451845 | 23.661421 | 26.302607 | 29.029128 | 42.086975 |
| az | 3937.0 | -24.499197 | 8.190332 | -47.297121 | -30.210264 | -24.060181 | -17.806268 | -6.888317 |
| sz_top | 3937.0 | 3.347107 | 0.174057 | 2.77 | 3.24 | 3.34 | 3.46 | 3.83 |
| sz_bot | 3937.0 | 1.572469 | 0.095198 | 1.22 | 1.52 | 1.57 | 1.64 | 1.85 |
| hit_distance_sc | 3937.0 | 156.254001 | 133.289945 | 0.0 | 16.0 | 140.0 | 279.0 | 452.0 |
| launch_speed | 3937.0 | 88.0968 | 14.275153 | 12.7 | 79.4 | 90.4 | 99.1 | 116.2 |
| launch_angle | 3937.0 | 11.106934 | 28.732692 | -84.0 | -7.0 | 12.0 | 29.0 | 88.0 |
| effective_speed | 3937.0 | 88.40348 | 6.415586 | 36.4 | 84.4 | 90.0 | 93.1 | 101.3 |
| release_spin_rate | 3937.0 | 2221.891288 | 337.072637 | 675.0 | 2040.0 | 2250.0 | 2436.0 | 3389.0 |
| release_extension | 3937.0 | 6.321946 | 0.44806 | 4.0 | 6.0 | 6.3 | 6.6 | 8.0 |
| release_pos_y | 3937.0 | 54.177557 | 0.446402 | 52.54 | 53.89 | 54.15 | 54.48 | 56.53 |
| spin_axis | 3937.0 | 177.386081 | 70.718843 | 2.0 | 136.0 | 201.0 | 222.0 | 356.0 |
| run_diff | 3937.0 | -0.000508 | 3.316855 | -21.0 | -1.0 | 0.0 | 1.0 | 21.0 |
The majority of these numeric variables are not normally distributed as expected and any outliers that may appear can be jusitified within the context of baseball.
Pearson correlation heatmap:
#specify size of heatmap
fig, ax = plt.subplots(figsize=(20, 20))
#create heatmap
ax = sns.heatmap(wrigley_2022_trimmedCat.corr(numeric_only = True), annot = True, vmin = -1, cmap = 'PiYG');
The following variables pairs are highy correlated (>|0.9|):
corr_abs = wrigley_2022_trimmedCat.corr(numeric_only = True).abs() #absolute value correlations
corr_pairs = corr_abs.unstack() #unstack dataframe
corr_sorted = corr_pairs.sort_values(kind="quicksort", ascending = False) #sort values
#double inequality for values greater than 0.9
corr_sorted_filtered = corr_sorted[(corr_sorted > 0.9) & (corr_sorted < 1)]
#dataframe of top absolute value correlations
pd.DataFrame(data = {'Variable Pairs': corr_sorted_filtered.iloc[::2].index.tolist(),
'Pearson correlations': corr_sorted_filtered.iloc[::2].values},
index = range(1, len(corr_sorted_filtered.iloc[::2])+1))
| Variable Pairs | Pearson correlations | |
|---|---|---|
| 1 | (release_speed, vy0) | 0.999840 |
| 2 | (release_pos_y, release_extension) | 0.997869 |
| 3 | (vy0, effective_speed) | 0.994013 |
| 4 | (effective_speed, release_speed) | 0.993635 |
| 5 | (pfx_x, ax) | 0.982778 |
| 6 | (pfx_z, az) | 0.981021 |
| 7 | (vx0, release_pos_x) | 0.936242 |
These variable pairs are strongly correlated in either direction because they mostly have causal relationships (i.e. one variable is involved in the calcuation of the the other variable).
Scatterplot of Exit Velocity vs Launch Angle:
plt.figure(figsize = (10, 5))
temp_scatter = plt.scatter(x = wrigley_2022_trimmedCat['launch_speed'],
y = wrigley_2022_trimmedCat['launch_angle'],
c = wrigley_2022_trimmedCat['outcome'].astype('category').cat.codes)
plt.title('Launch Speed vs Launch Angle\nBy Outcome')
plt.xlabel('Launch Speed (mph)')
plt.ylabel('Launch Angle (degrees)')
plt.legend(handles = temp_scatter.legend_elements()[0],
labels = ['hits', 'outs'],
prop={'size': 10});
While the correlation coefficient for launch_speed and launch_angle is only 0.109, there is more to the scatterplot when the data is separated by target labeling. There is clearly a sweet spot in terms of launch angle that leads to hits. If a ball is hit too high in the air (high launch angle), the fielders will have enough time to react and catch the ball before it falls for a hit. Likewise, if a batter does not hit the ball high enough, the ball will hit the ground quickly and be captured by an infielder and thrown to first base for an out. As indicated in the plot, there is generally a wider range of outcomes when the ball is hit hard enough (launch speed) regardless of angle. The few hits with negative launch angle and slower launch speed were balls that were into the ground immediately but moved slow enough on the ground for the batter to achieve a hit before a fielder could capture the ball.
Barchart of Outcomes by Pitch-Type:
print(wrigley_2022[['pitch_type', 'pitch_name']].value_counts())
#out percentages
temp_outs = []
for i in (list(range(0, 5, 2)) + [7] + list(range(8, 23, 2))): #indices with non-hit percentages are
temp_outs.append(wrigley_2022_trimmedCat.groupby(['pitch_type'])['outcome'].value_counts(normalize = True).values[i])
#hit percentages
temp_hits = []
for i in (list(range(1, 6, 2)) + [6] + list(range(9, 24, 2))): #indices with hit percentages
temp_hits.append(wrigley_2022_trimmedCat.groupby(['pitch_type'])['outcome'].value_counts(normalize = True).values[i])
temp_index = ['Sweeper', 'Curveball', 'Slurve', 'Changeup', 'Slider', 'Sinker', '4-Seam\nFastball', 'Knuckle\nCurve',
'Cutter', 'Split\nFinger', 'Eephus', 'Other']
temp_index.reverse()
pd.DataFrame({'hits':sorted(temp_hits, reverse = True),
'outs':sorted(temp_outs)}, index = temp_index)\
.plot.bar(rot = 0, stacked = True, color = {'outs':'dodgerblue', 'hits':'lightblue'},
title = 'Outcomes by Type of Pitch', figsize = (10,5),
xlabel = 'Pitch Type', ylabel = 'Proportion');
pitch_type pitch_name FF 4-Seam Fastball 1099 SI Sinker 893 SL Slider 594 CH Changeup 461 FC Cutter 379 CU Curveball 250 KC Knuckle Curve 112 ST Sweeper 89 FS Split-Finger 70 SV Slurve 28 FA Other 23 EP Eephus 4 dtype: int64
In baseball, pitchers throw different types of pitchers towards the batter to either influence bad contact with the ball, a missed swing, or hesitation to swing at the baseball altogether. The first bar chart indicates that 4-seam fastballs, sinkers, and sliders are thrown most commonly while eephus pitches, slurves, and split-finger fastballs are thrown more sparingly. The bar chart next shows the proportion of outcomes based on each different type of pitch thrown. Of pitches thrown relatively often, hitters are most likely to create hits out of contacted Split-finger fastballs and least likely to create hits out of contacted sweeper breaking balls.
Barchart of types of contact by outcome:
#out percentages
temp_outs = []
for i in ([1, 3] + list(range(4,11,2))):
temp_outs.append(wrigley_2022_trimmedCat.groupby(['contact_type'])['outcome'].value_counts(normalize = True).values[i])
#hit percentages
temp_hits = []
for i in ([0,2] + list(range(5,12,2))):
temp_hits.append(wrigley_2022_trimmedCat.groupby(['contact_type'])['outcome'].value_counts(normalize = True).values[i])
temp_index = ['Under', 'Weak', 'Topped', 'Solid', 'Flare/Burner', 'Barrel']
temp_index.reverse()
pd.DataFrame({'hits':sorted(temp_hits, reverse = True),
'outs':sorted(temp_outs)}, index = temp_index)\
.plot.bar(rot = 0, stacked = True, color = {'outs':'saddlebrown', 'hits':'orange'},
title = 'Outcomes by Type of Contact', figsize = (10,5),
xlabel = 'Contact Type', ylabel = 'Proportion');
The x-axis in the barchart above lists the different types of contact that can result from a batted baseball. The graphic directly below it introduces how each type of contact is defined (each is a certain combination of launch angle and launch speed). As indicated in the barchart, barrels, flare/burners, and solid contact are most likely to result in hits, relative to the rest of the options available. The less productive types of contact make it easier for fielders to react to the baseball before it reaches the ground (hit under) or fast enough that it can be thrown to the first baseman for an out (topped or weak).
Outcomes by Location On Field:
from pybaseball import spraychart
spraychart(data = wrigley_2022_newCat, team_stadium = 'cubs', title = '2022 at Wrigley Field - Balls In Play', size = 30,
colorby = 'outcome', legend_title = 'Outcome', width = 700, height = 700);
Baseball has 9 defensive positions included in the image above. These fielders can turn contacted baseball into outs by catching the ball before it hits the ground, capturing the ball and throwing to the first baseman and a few more ways that aren't relevant to this exercise. Batters ultimately intend to hit the baseball away from fielders in most circumstances.
The pybaseball library includes a spraychart function that can plot the outcomes of baseball events using hc_x and hc_y variables along with using the outcome variable to differentiate those outcomes. There are 2 distinct zones where most hits can occur. The first zone exists towards the top of the spraychart where hypothetical homeruns would be hit (which counts for four runs(points) or close enough to the outfield wall to be within catching distance of an outfielder. The second exists between in the outfield beyond the reach of the infielders but in front of the outfielders. Hits can occur in the infield as well if the balls are hit softly enough that the infielders cannot capture and throw the ball to the first baseman before the batter can reach first base.
Overlayed histogram of Hit Coordinate Y of Batted Ball (by Outcome):
plt.hist(x=wrigley_2022_trimmedCat[wrigley_2022_trimmedCat['outcome'] == 'Hit']['hc_y'],
alpha = 0.5, bins = 20, label = 'Hit', color = 'orange')
plt.hist(x=wrigley_2022_trimmedCat[wrigley_2022_trimmedCat['outcome'] == 'Non-hit']['hc_y'],
alpha = 0.5, bins = 20, label = 'Non-hit', color = 'dodgerblue')
plt.legend(loc = 'upper right')
plt.title('Distribution of Hit coordinate Y (hc_y) of Batted Ball (by Outcome)')
plt.gca().invert_xaxis() #inverts the x-axis
plt.xlabel('hc_y')
plt.ylabel('Count')
plt.show;
This overlayed histogram is a more statistically-familiar way of representing the spraychart above. The distribution of non-hit locations is relatively binomial while the distribution of hit locations is normal.
Dummy variable conversion and reset index:
wrigley_2022_dummy = wrigley_2022_trimmedCat.copy() #new copy of dataset
wrigley_2022_dummy['outcome'] = np.where(wrigley_2022_dummy['outcome'] == 'Hit', 1, 0) #binarize target variable
wrigley_2022_dummy = pd.get_dummies(wrigley_2022_dummy) #create dummy variables for categorical variables
wrigley_2022_dummy = wrigley_2022_dummy.iloc[::-1]#.reset_index(drop = True)
#move target variable to last column position
wrigley_2022_dummy.insert(loc = wrigley_2022_dummy.shape[1]-1,
column = 'outcome',
value = wrigley_2022_dummy.pop('outcome'))
Separate target variable (outcome):
y = wrigley_2022_dummy['outcome']
X = wrigley_2022_dummy.drop(['outcome'], axis = 1)
Split dataset into training, validation, and test sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, random_state = RANDOM_STATE)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size = 0.5, random_state = RANDOM_STATE)
Normalize the training set and apply the fits to the validation and test sets:
mm_scaler = MinMaxScaler() #initiate the scaler
mm_scaler.fit(X_train)
X_train = mm_scaler.transform(X_train)
X_val = mm_scaler.transform(X_val)
X_test = mm_scaler.transform(X_test)
Generalized classification function that returns a confusion matrix and evaluation metrics:
def generalized_clf(clf, train_dep, train_ind, unseen_dep, unseen_ind, test_set = False):
"""takes a classification model, training data, and validation (or test) data as input,
returns an aesthetic confusion matrix, and prints/returns the following evaluation metrics as a list:
Recall (Sensitivity)
Specificity
Precision
Balanced Accuracy
F1 Score"""
clf.fit(train_dep, train_ind) #fit model to training data
dep_pred_unseen = clf.predict(unseen_dep) #prediction on unseen data
conf_mat = confusion_matrix(unseen_ind, dep_pred_unseen) #confusion matrix
#nicer confusion matrix production
sequential = ['Blues', 'Greens', 'Oranges', 'Purples', 'Reds'] #sequential color pallettes
sns.heatmap(conf_mat, annot=True, cmap = random.choice(sequential), fmt="d") #colors chosen at random
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
#evaluation metrics (based on confusion matrix and module metrics)
tn = conf_mat[0,0] #true negative
fp = conf_mat[0,1] #false positive
fn = conf_mat[1,0] #false negative
tp = conf_mat[1,1] #true positive
unseen_recall = recall_score(unseen_ind, dep_pred_unseen) #unseen recall
unseen_specificity = tn/(tn + fp) #unseen specificity
unseen_precision = precision_score(unseen_ind, dep_pred_unseen) #unseen precision
unseen_balanced_accuracy = balanced_accuracy_score(unseen_ind, dep_pred_unseen) #unseen balanced accuracy
unseen_f1 = f1_score(unseen_ind, dep_pred_unseen) #unseen f1 score
if test_set == False:
print('Metrics on the validation set:\n')
else:
print('Metrics on the test set:\n')
print('Recall (Sensitivity):', round(unseen_recall, 5))
print('Specificity:', round(unseen_specificity, 5))
print('Precision:', round(unseen_precision, 5))
print('Balanced Accuracy:', round(unseen_balanced_accuracy, 5))
print('F1 Score:', round(unseen_f1, 5))
return [round(x, 5) for x in
[unseen_recall, unseen_specificity, unseen_precision, unseen_balanced_accuracy, unseen_f1]]
Generalized hyper-parameter tuning function that returns a confusion matrix and evaluation metrics:
def generalized_tuning_grid(clf, params_space, train_dep, train_ind, unseen_dep, unseen_ind):
"""takes a classification model, parameter grid space, training data, and validation data as input
and returns a classification model with updated hyperparameters from the results of a grid search of the
provided parameter grid space"""
grid_search = GridSearchCV(clf, param_grid = params_space, scoring = 'f1') #use 5-fold cross-validation
grid_search.fit(train_dep, train_ind) # Fit the grid search to training data
for param in grid_search.best_params_.keys(): #print best values for parameters
print(f"Best {param}: {grid_search.best_params_[param]}")
print("Best Score:", round(grid_search.best_score_, 4))
clf.set_params(**grid_search.best_params_) #sets classifier to best parameters from gridsearch
return clf
Model comparison dataframes for classifiers:
clf_eval_metrics = ['Recall', 'Specificity', 'Precision', 'Balanced Accuracy', 'F1 Score']
all_params_dict = {} #storing parameters from all the models
clf_comp_df = pd.DataFrame(index = clf_eval_metrics)
clf_comp_df_test = pd.DataFrame(index = clf_eval_metrics)
Logistic Regression:
estimator = LogisticRegression(max_iter=1000, random_state = RANDOM_STATE, n_jobs = -2, solver='saga')
all_params_dict['Default LR'] = estimator.get_params()
clf_comp_df['Default LR'] = generalized_clf(estimator, X_train, y_train, X_val, y_val)
Metrics on the validation set: Recall (Sensitivity): 0.77596 Specificity: 0.94363 Precision: 0.86061 Balanced Accuracy: 0.85979 F1 Score: 0.81609
Decision Tree:
estimator = DecisionTreeClassifier(random_state = RANDOM_STATE)
all_params_dict['Default DT'] = estimator.get_params()
clf_comp_df['Default DT'] = generalized_clf(estimator, X_train, y_train, X_val, y_val)
Metrics on the validation set: Recall (Sensitivity): 0.80874 Specificity: 0.90196 Precision: 0.78723 Balanced Accuracy: 0.85535 F1 Score: 0.79784
Random Forest:
estimator = RandomForestClassifier(random_state = RANDOM_STATE)
all_params_dict['Default RF'] = estimator.get_params()
clf_comp_df['Default RF'] = generalized_clf(estimator, X_train, y_train, X_val, y_val)
Metrics on the validation set: Recall (Sensitivity): 0.79781 Specificity: 0.95098 Precision: 0.87952 Balanced Accuracy: 0.8744 F1 Score: 0.83668
AdaBoost (Adaptive Boosting):
estimator = AdaBoostClassifier(random_state = RANDOM_STATE)
all_params_dict['Default AB'] = estimator.get_params()
clf_comp_df['Default AB'] = generalized_clf(estimator, X_train, y_train, X_val, y_val)
Metrics on the validation set: Recall (Sensitivity): 0.77049 Specificity: 0.96324 Precision: 0.90385 Balanced Accuracy: 0.86686 F1 Score: 0.83186
Gradient Boosting:
estimator = GradientBoostingClassifier(random_state = RANDOM_STATE)
all_params_dict['Default GB'] = estimator.get_params()
clf_comp_df['Default GB'] = generalized_clf(estimator, X_train, y_train, X_val, y_val)
Metrics on the validation set: Recall (Sensitivity): 0.77049 Specificity: 0.96814 Precision: 0.91558 Balanced Accuracy: 0.86931 F1 Score: 0.8368
Neural Network (Multi-layer Perceptron):
estimator = MLPClassifier(random_state = RANDOM_STATE, max_iter = 400)
all_params_dict['Default NN'] = estimator.get_params()
clf_comp_df['Default NN'] = generalized_clf(estimator, X_train, y_train, X_val, y_val)
Metrics on the validation set: Recall (Sensitivity): 0.78142 Specificity: 0.90441 Precision: 0.78571 Balanced Accuracy: 0.84292 F1 Score: 0.78356
clf_comp_df.T.sort_values('F1 Score', ascending = False)
| Recall | Specificity | Precision | Balanced Accuracy | F1 Score | |
|---|---|---|---|---|---|
| Default GB | 0.77049 | 0.96814 | 0.91558 | 0.86931 | 0.83680 |
| Default RF | 0.79781 | 0.95098 | 0.87952 | 0.87440 | 0.83668 |
| Default AB | 0.77049 | 0.96324 | 0.90385 | 0.86686 | 0.83186 |
| Default LR | 0.77596 | 0.94363 | 0.86061 | 0.85979 | 0.81609 |
| Default DT | 0.80874 | 0.90196 | 0.78723 | 0.85535 | 0.79784 |
| Default NN | 0.78142 | 0.90441 | 0.78571 | 0.84292 | 0.78356 |
The gradient boosting and random forest classifier performed the best when applied to validation data, relative to the other default classifiers. F1 Score is the primary evaluation metric in this analysis because the distribution of outcomes (hits and non-hits) is imbalanced. In addition, the goal is to predict outcomes rather than focus specifically on one class over another.
Type and extent of regularization are two types of hyperparameters that impact the performance of logistic regression classifiers. There are three types of regularizations:
The penalty hyperparameter will represent that choice while C represents the inverse of regularization strength.
#initiate logistic regression classifier
estimator = LogisticRegression(max_iter=1000, random_state = RANDOM_STATE, n_jobs = -2, solver='saga', l1_ratio = 0.5)
#specify the parameter grid for grid search
param_grid = {'penalty': ['elasticnet', 'l1', 'l2', None], # norm of the penalty
'C': [0.001, 0.01, .1, 1, 10]} # inverse of regularization strength
estimator = generalized_tuning_grid(estimator, param_grid, X_train, y_train, X_val, y_val) #update parameters
all_params_dict['Optimal LR'] = estimator.get_params()
clf_comp_df['Optimal LR'] = generalized_clf(estimator, X_train, y_train, X_val, y_val)
Best C: 10 Best penalty: elasticnet Best Score: 0.8405
Metrics on the validation set: Recall (Sensitivity): 0.75956 Specificity: 0.94853 Precision: 0.86875 Balanced Accuracy: 0.85405 F1 Score: 0.8105
clf_comp_df.T.loc[['Default LR', 'Optimal LR']]
| Recall | Specificity | Precision | Balanced Accuracy | F1 Score | |
|---|---|---|---|---|---|
| Default LR | 0.77596 | 0.94363 | 0.86061 | 0.85979 | 0.81609 |
| Optimal LR | 0.75956 | 0.94853 | 0.86875 | 0.85405 | 0.81050 |
The logistic regression classifier regressed slightly after hyperparameter tuning given that the tuned model's F1 score decreased by 0.5 percentage points.
Decision trees have three major hyperparameters (among others) that impact the quality of the model:
estimator = DecisionTreeClassifier(random_state = RANDOM_STATE) #initiate decision tree classifier
#specify the parameter grid for grid search
param_grid = {'max_depth': range(1,11), # maximum depth of the tree
'min_samples_split': [20, 25, 30], # minimum number of samples required to split an internal node
'min_samples_leaf': [5, 10, 15]} # minimum number of samples required to be at a leaf node
estimator = generalized_tuning_grid(estimator, param_grid, X_train, y_train, X_val, y_val) #update parameters
base_dt_clf = estimator
all_params_dict['Optimal DT'] = estimator.get_params()
clf_comp_df['Optimal DT'] = generalized_clf(estimator, X_train, y_train, X_val, y_val)
Best max_depth: 10 Best min_samples_leaf: 10 Best min_samples_split: 25 Best Score: 0.8324
Metrics on the validation set: Recall (Sensitivity): 0.78142 Specificity: 0.94608 Precision: 0.86667 Balanced Accuracy: 0.86375 F1 Score: 0.82184
clf_comp_df.T.loc[['Default DT', 'Optimal DT']]
| Recall | Specificity | Precision | Balanced Accuracy | F1 Score | |
|---|---|---|---|---|---|
| Default DT | 0.80874 | 0.90196 | 0.78723 | 0.85535 | 0.79784 |
| Optimal DT | 0.78142 | 0.94608 | 0.86667 | 0.86375 | 0.82184 |
The decision tree classifier improved after hyperparameter tuning given that the tuned model's F1 score increased by 2.5 percentage points.
Please refer to the Decision Tree Optimization because these two sections share the same hyperparameters:
estimator = RandomForestClassifier(random_state = RANDOM_STATE, n_jobs = -2) #initiate random forest classifier
#specify the parameter grid for grid search
param_grid = {'max_depth': range(1,11), # maximum depth of the tree
'min_samples_split': [20, 25, 30], # minimum number of samples required to split an internal node
'min_samples_leaf': [5, 10, 15]} # minimum number of samples required to be at a leaf node
estimator = generalized_tuning_grid(estimator, param_grid, X_train, y_train, X_val, y_val) #update parameters
all_params_dict['Optimal RF'] = estimator.get_params()
clf_comp_df['Optimal RF'] = generalized_clf(estimator, X_train, y_train, X_val, y_val)
Best max_depth: 10 Best min_samples_leaf: 5 Best min_samples_split: 20 Best Score: 0.8404
Metrics on the validation set: Recall (Sensitivity): 0.79781 Specificity: 0.94608 Precision: 0.86905 Balanced Accuracy: 0.87195 F1 Score: 0.83191
clf_comp_df.T.loc[['Default RF', 'Optimal RF']]
| Recall | Specificity | Precision | Balanced Accuracy | F1 Score | |
|---|---|---|---|---|---|
| Default RF | 0.79781 | 0.95098 | 0.87952 | 0.87440 | 0.83668 |
| Optimal RF | 0.79781 | 0.94608 | 0.86905 | 0.87195 | 0.83191 |
The random forest classifier regressed slightly after hyperparameter tuning given that the tuned model's F1 score decreased by 0.5 percentage points.
The learning rate in an AdaBoost classifier detemerines the weight applied to each classifier at each boosting iteraction. Therefore, a greater learning rate increases the contribution of each classifier. If the learning rate is too high, however, the model performance may decrease because the classifier applies too much weight to earlier models.
estimator = AdaBoostClassifier(random_state = RANDOM_STATE, estimator = base_dt_clf) #initiate AdaBoost classifier
#specify the parameter grid for grid search
param_grid = {'learning_rate': [0.001, 0.01, 0.1, 1]} # weight applied to each classifier at each boosting iteration
estimator = generalized_tuning_grid(estimator, param_grid, X_train, y_train, X_val, y_val) #update parameters
all_params_dict['Optimal AB'] = estimator.get_params()
clf_comp_df['Optimal AB'] = generalized_clf(estimator, X_train, y_train, X_val, y_val)
Best learning_rate: 0.01 Best Score: 0.8446
Metrics on the validation set: Recall (Sensitivity): 0.77049 Specificity: 0.96324 Precision: 0.90385 Balanced Accuracy: 0.86686 F1 Score: 0.83186
clf_comp_df.T.loc[['Default AB', 'Optimal AB']]
| Recall | Specificity | Precision | Balanced Accuracy | F1 Score | |
|---|---|---|---|---|---|
| Default AB | 0.77049 | 0.96324 | 0.90385 | 0.86686 | 0.83186 |
| Optimal AB | 0.77049 | 0.96324 | 0.90385 | 0.86686 | 0.83186 |
The AdaBoost classifier yielded the same values after hyperparameter tuning. It turns out that decreasing the value for learning rate did not have affect the classifier's ability to produce a different meta-estimator.
Please refer to the AdaBoost Optimization because these two sections share the same hyperparameter (learning_rate).
estimator = GradientBoostingClassifier(random_state = RANDOM_STATE) #initiate Gradient Boosting classifier
#applying parameters from the base decision tree to the Gradient boosting machine
custom_params = {}
for key in ['max_depth', 'min_samples_split', 'min_samples_leaf']:
custom_params[key] = base_dt_clf.get_params()[key]
estimator.set_params(**custom_params)
#specify the parameter grid for grid search
param_grid = {'learning_rate': [0.001, 0.01, 0.1, 1]} # weight applied to each classifier at each boosting iteration
estimator = generalized_tuning_grid(estimator, param_grid, X_train, y_train, X_val, y_val) #update parameters
all_params_dict['Optimal GB'] = estimator.get_params()
clf_comp_df['Optimal GB'] = generalized_clf(estimator, X_train, y_train, X_val, y_val)
Best learning_rate: 0.1 Best Score: 0.8637
Metrics on the validation set: Recall (Sensitivity): 0.78142 Specificity: 0.96569 Precision: 0.91083 Balanced Accuracy: 0.87355 F1 Score: 0.84118
clf_comp_df.T.loc[['Default GB', 'Optimal GB']]
| Recall | Specificity | Precision | Balanced Accuracy | F1 Score | |
|---|---|---|---|---|---|
| Default GB | 0.77049 | 0.96814 | 0.91558 | 0.86931 | 0.83680 |
| Optimal GB | 0.78142 | 0.96569 | 0.91083 | 0.87355 | 0.84118 |
The Gradient Boosting classifier improved slightly after hyperparameter tuning given that the tuned model's F1 score increased by 0.5 percentage points.
Two key hyperparameters within a neural network are the stucture of its hidden layers and the activation function for those hidden layers. Located inbetween input layers and output layers, a hidden layer is where neurons (or perceptrons) accept weighted inputs and produce an output that is then provided as input for a following activation function. An activation function then is used to determine an output value for a given node based on the input it receives.
estimator = MLPClassifier(random_state = RANDOM_STATE, max_iter = 1000) #initiate multi-layer perceptron classifier
#specify the parameter grid for grid search
param_grid = {'hidden_layer_sizes': [(50, 50), (100, 50), (50, 50, 50)] , # number of neurons in the ith hidden layer
'activation': ['logistic', 'tanh', 'relu']} # activation function for the hidden layer
estimator = generalized_tuning_grid(estimator, param_grid, X_train, y_train, X_val, y_val) #update parameters
all_params_dict['Optimal NN'] = estimator.get_params()
clf_comp_df['Optimal NN'] = generalized_clf(estimator, X_train, y_train, X_val, y_val)
Best activation: logistic Best hidden_layer_sizes: (50, 50, 50) Best Score: 0.8347
Metrics on the validation set: Recall (Sensitivity): 0.7541 Specificity: 0.88235 Precision: 0.74194 Balanced Accuracy: 0.81823 F1 Score: 0.74797
clf_comp_df.T.loc[['Default NN', 'Optimal NN']]
| Recall | Specificity | Precision | Balanced Accuracy | F1 Score | |
|---|---|---|---|---|---|
| Default NN | 0.78142 | 0.90441 | 0.78571 | 0.84292 | 0.78356 |
| Optimal NN | 0.75410 | 0.88235 | 0.74194 | 0.81823 | 0.74797 |
The Neural Network classifier regressed after hyperparameter tuning given that the tuned model's F1 score decreased by over 3.5 percentage points.
clf_comp_df.T.loc[['Optimal LR', 'Optimal DT', 'Optimal RF', 'Optimal AB','Optimal GB', 'Optimal NN']]\
.sort_values('F1 Score', ascending = False)
| Recall | Specificity | Precision | Balanced Accuracy | F1 Score | |
|---|---|---|---|---|---|
| Optimal GB | 0.78142 | 0.96569 | 0.91083 | 0.87355 | 0.84118 |
| Optimal RF | 0.79781 | 0.94608 | 0.86905 | 0.87195 | 0.83191 |
| Optimal AB | 0.77049 | 0.96324 | 0.90385 | 0.86686 | 0.83186 |
| Optimal DT | 0.78142 | 0.94608 | 0.86667 | 0.86375 | 0.82184 |
| Optimal LR | 0.75956 | 0.94853 | 0.86875 | 0.85405 | 0.81050 |
| Optimal NN | 0.75410 | 0.88235 | 0.74194 | 0.81823 | 0.74797 |
After hyperparameter tuning each model, the tuned gradient boosting classifier performed the best on validation data, relative to other tuned classifiers. As mentioned, the decision tree showed the greatest improvment while the neural network regressed the most.
Classifier evaluation metrics on validation set:
clf_comp_df.T.sort_values('F1 Score', ascending = False)
| Recall | Specificity | Precision | Balanced Accuracy | F1 Score | |
|---|---|---|---|---|---|
| Optimal GB | 0.78142 | 0.96569 | 0.91083 | 0.87355 | 0.84118 |
| Default GB | 0.77049 | 0.96814 | 0.91558 | 0.86931 | 0.83680 |
| Default RF | 0.79781 | 0.95098 | 0.87952 | 0.87440 | 0.83668 |
| Optimal RF | 0.79781 | 0.94608 | 0.86905 | 0.87195 | 0.83191 |
| Default AB | 0.77049 | 0.96324 | 0.90385 | 0.86686 | 0.83186 |
| Optimal AB | 0.77049 | 0.96324 | 0.90385 | 0.86686 | 0.83186 |
| Optimal DT | 0.78142 | 0.94608 | 0.86667 | 0.86375 | 0.82184 |
| Default LR | 0.77596 | 0.94363 | 0.86061 | 0.85979 | 0.81609 |
| Optimal LR | 0.75956 | 0.94853 | 0.86875 | 0.85405 | 0.81050 |
| Default DT | 0.80874 | 0.90196 | 0.78723 | 0.85535 | 0.79784 |
| Default NN | 0.78142 | 0.90441 | 0.78571 | 0.84292 | 0.78356 |
| Optimal NN | 0.75410 | 0.88235 | 0.74194 | 0.81823 | 0.74797 |
The optimized gradient boosting classifier produced the highest F1 score of every model applied to the validation set. Overall, the gradient boosting classifiers outperformed every other type of classifier, regardless of the hyperparameter tuning. The second and third best types of classifiers were the random forest and AdaBoost classifier, respectively, although, as previously mentioned, the default classifiers for these two types outperformed their optimized equivalents. Ultimately, the neural network would result in the worst performing models, regardless of classifier type.
For final testing, the optimal gradient boosting classifier and the optimal random forest classifier were chosen because their hyperparameters would theoretically be more robust to unseen data and not overfit to training data (even if that didn't occur with the validation set).
In addition, two additional ensemble classifiers will be applied to the test set:
Each voting classifier will incorporate soft voting which means the output class for each instance will be maded based on the average probability given to that class from each inputted model.
Optimized Gradient Boosting (on test set):
#Optimized Gradient boosting machine on test set
estimator = GradientBoostingClassifier().set_params(**all_params_dict['Optimal GB'])
clf_comp_df_test['Optimal GB'] = generalized_clf(estimator, X_train, y_train, X_test, y_test, test_set = True)
Metrics on the test set: Recall (Sensitivity): 0.79082 Specificity: 0.94937 Precision: 0.88571 Balanced Accuracy: 0.87009 F1 Score: 0.83558
Optimized AdaBoost (on test set):
#Optimized AdaBoost classifier on test set
estimator = RandomForestClassifier().set_params(**all_params_dict['Optimal RF'])
clf_comp_df_test['Optimal RF'] = generalized_clf(estimator, X_train, y_train, X_test, y_test, test_set = True)
Metrics on the test set: Recall (Sensitivity): 0.75 Specificity: 0.93924 Precision: 0.85965 Balanced Accuracy: 0.84462 F1 Score: 0.80109
Voting classifier with best optimized models (on test set):
#voting classifier combining the two best optimized classifiers
clf1 = RandomForestClassifier().set_params(**all_params_dict['Optimal RF'])
clf2 = GradientBoostingClassifier().set_params(**all_params_dict['Optimal GB'])
estimator = VotingClassifier(estimators = [('Optimal RF', clf1), ('Optimal GB', clf2)], voting='soft', n_jobs=-2)
clf_comp_df_test['Optimal GB/RF Ensemble'] = generalized_clf(estimator, X_train, y_train, X_test, y_test, test_set = True)
Metrics on the test set: Recall (Sensitivity): 0.78061 Specificity: 0.95949 Precision: 0.90533 Balanced Accuracy: 0.87005 F1 Score: 0.83836
Voting classifier with best default models (on test set):
#voting classifier combining the two best default classifiers
clf1 = GradientBoostingClassifier().set_params(**all_params_dict['Default GB'])
clf2 = RandomForestClassifier().set_params(**all_params_dict['Default RF'])
estimator = VotingClassifier(estimators = [('Default GB', clf1), ('Default RF', clf2)], voting='soft', n_jobs=-2)
clf_comp_df_test['Default GB/RF Ensemble'] = generalized_clf(estimator, X_train, y_train, X_test, y_test, test_set = True)
Metrics on the test set: Recall (Sensitivity): 0.80102 Specificity: 0.95949 Precision: 0.90751 Balanced Accuracy: 0.88026 F1 Score: 0.85095
clf_comp_df_test.T.sort_values('F1 Score', ascending = False)
| Recall | Specificity | Precision | Balanced Accuracy | F1 Score | |
|---|---|---|---|---|---|
| Default GB/RF Ensemble | 0.80102 | 0.95949 | 0.90751 | 0.88026 | 0.85095 |
| Optimal GB/RF Ensemble | 0.78061 | 0.95949 | 0.90533 | 0.87005 | 0.83836 |
| Optimal GB | 0.79082 | 0.94937 | 0.88571 | 0.87009 | 0.83558 |
| Optimal RF | 0.75000 | 0.93924 | 0.85965 | 0.84462 | 0.80109 |
The default gradient boosting and random forest ensemble classifier performed the best on the test set by over 1 percentage point in F1 score, relative to the other test set models. Of the models that were not voting classfiers, the gradient boosting classifier outperformed the random foest. Interestingly, I believe this outcome further confirms how two inferior learners can work together to produce a (slightly) superior model. While it was unexpected for a voting ensemble consisting entirely of default models to perform the best, there are methods available to improving the grid search process including widening the available search options themselves. Finally, a higher F1 score is ideal but roughly 85% is a fine starting point.
estimator = GradientBoostingClassifier().set_params(**all_params_dict['Optimal GB'])
estimator.fit(X_train, y_train)
fig = plt.figure(figsize=(20, 10))
importance = pd.Series(estimator.feature_importances_, index=wrigley_2022_dummy.columns[:-1]).sort_values(ascending=False)
sns.barplot(x=importance[:11], y=importance[:11].index)
plt.title('Feature Importance')
plt.xlabel('Score')
plt.show()
The fact that the gradient boosting machine was the highest performing (non-voting) model permitted an investigation of the most important features. As expected, the type of contact (contact_type) and the location where the ball lands (hc_y) plays a huge role in determining whether a contacted ball is a hit or an out. Additional commentary on the top 10 variables:
The bolded statistics were calculated directly below. Overall, the most import features in the dataset match baseball convention and logic.
Hit probability by contact type:
contact_type_dict = {}
for item in wrigley_2022_trimmedCat['contact_type'].unique():
contact_type_dict[item] = wrigley_2022_trimmedCat[wrigley_2022_trimmedCat['contact_type'] == item]['outcome']\
.value_counts(normalize=True)
pd.DataFrame(contact_type_dict).sort_values(by='Hit', axis = 1)
| Under | Weak | Topped | Solid | Flare/Burner | Barrel | |
|---|---|---|---|---|---|---|
| Hit | 0.085896 | 0.121795 | 0.163862 | 0.455752 | 0.665638 | 0.74359 |
| Non-hit | 0.914104 | 0.878205 | 0.836138 | 0.544248 | 0.334362 | 0.25641 |
Hit probability by type of fielder that touches the ball first:
#hit probability if an infielder touches the ball first
infield_hit_chance = wrigley_2022_trimmedCat[(wrigley_2022_trimmedCat['hit_destination'] != 'LF') &
(wrigley_2022_trimmedCat['hit_destination'] != 'CF') &
(wrigley_2022_trimmedCat['hit_destination'] != 'RF') &
(wrigley_2022_trimmedCat['hit_destination'] != 'seats')]['outcome']\
.value_counts(normalize = True)
#hit probability if an outfielder touches the ball first
outfield_hit_chance = wrigley_2022_trimmedCat[(wrigley_2022_trimmedCat['hit_destination'] == 'LF')|
(wrigley_2022_trimmedCat['hit_destination'] == 'CF')|
(wrigley_2022_trimmedCat['hit_destination'] == 'RF')]['outcome']\
.value_counts(normalize = True)
pd.DataFrame(data={'Infielder': infield_hit_chance,
'Outfielder': outfield_hit_chance})
| Infielder | Outfielder | |
|---|---|---|
| Hit | 0.079937 | 0.511678 |
| Non-hit | 0.920063 | 0.488322 |
pybaseball data with degree of granularity exists as far back as the 2017 season. If more data is inputted into the training sets, better models will be trained for unseen, future data.sci-kit learn's implementation of neural networks is very high-level. An incorporation of TensorFlow/Kera's implementation would likely yield better models due to greater customization although the main issue with neural networks is its black-box nature which prevents the user from deriving feature importances.